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Abstract goes here. . . 

INTRODUCTION 

Augmented Reality (AR) is a technique by which 
computer generated signals synthesize impressions that are 
made to coexist with the surrounding real world as perceived 
by the user. Human smell, taste, touch and hearing can all be 
augmented, but most commonly AR refers to the human 
vision being overlaid with information otherwise not readily 
available to the user. A correct calibration is important on an 
application level, ensuring that e.g. data labels are presented at 
correct locations, but also on a system level to enable display 
techniques such as stereoscopy to function properly 
[SOURCE]. Thus, vital to AR, calibration methodology is an 
important research area. While great achievements already 
have been made, there are some properties in current 
calibration methods for augmenting vision which do not 
translate from its traditional use in automated cameras 
calibration to its use with a human operator. This paper uses a 
Monte Carlo simulation of a standard direct linear 
transformation camera calibration to investigate how user 
introduced head orientation noise affects the parameter 
estimation during a calibration procedure of an optical see- 
through head mounted display. 

OVERVIEW AND RELATED WORK 
Display Techniques 

The display techniques by which the human vision is 
augmented are commonly divided into head-slaved helmet 
mounted displays (HMD) and stationary head-up displays 
(HUD). Both can be further subdivided into video see-through 
(VST) and optical see-through (OST) devices. In the case of 
VST, the user’s visual impressions are relayed by a video 
camera where the signal is composited with a data layer before 
it is presented to the user on an opaque screen. VST carries 
several benefits in terms of ease of calibration: A frame from 
the camera’s auxiliary video signal can be routed through a 
processing segmentation and corner detection procedure to 
simultaneously acquire hundreds of correspondence points 
with precision only limited by image blur and pixel 
quantization (Hartley & Zisserman, 2000). Said auxiliary 
signal can also be used to objectively estimate the calibration 
quality. Other benefits relate to the fact that the optics of the 


eye and the camera are separated: In the monoscopic case the 
requirement for correct screen positioning relative to the 
user’s eye is relaxed as the eye’s view vector and the screen’s 
center, the principal point, need not to be aligned. While 
harder to calibrate, OST has the benefit that it does not reduce 
the human vision to the capabilities of the camera in terms of 
acuity, dynamic range, and field of view (FOV) as the data 
layer is composited through an optical combiner, usually a 
slanted half-silvered mirror and a system of lenses. The direct 
optical path between the real world and the eye also provides 
the safety to revert to non-augmented vision in case of system 
failure. OST devices are however harder to calibrate for the 
exact opposite reasons as described above: The optical system 
of the screen must be carefully aligned with that of the user’s 
eye, which is a high demand when the calibration is dependent 
on the user’s subjective judgment in the absence of an 
auxiliary video signal. The paper focuses on calibration 
methods involving OST HMD for applications where human 
vision cannot be substituted with a camera view. 

Computer Models 

To be able to correctly merge the real and virtual world 
during user interaction with a dynamic scene, the AR system 
maintains a computer model to represent the location of real 
and virtual objects. The spatial relationships are normally 
modeled using linear transformation matrices containing 
rotations as well as translations through the use of 
homogeneous coordinates. As 4-by-4 matrices, they can be 
aggregated through multiplication to symbolize the traversal 
through local coordinate systems to describe the exact location 
of surrounding objects relative to the user’s eye, see Figure 1 
and (9). At the eye point the user’s view is traditionally 
modeled as a pinhole camera. Assuming no radial distortion in 
the optics this subsystem can be modeled as two additional 
matrices holding extrinsic and intrinsic camera parameters 
which together conveniently can be multiplied into the matrix 
aggregation. In the total matrix aggregation there is a subset of 
matrices that update dynamically due to user and object 
movement, and a subset that remain static where the spatial 
relationship between the objects is fixed. By singling out the 
subset of static matrices and aggregating them separately, the 
calibration procedure becomes the task of populating the 
elements of the aggregated matrix instead of determining each 



measurement individually. This is preferable since some of the 
measurements needed for an accurate calibration model are 
hard to obtain directly. In the case of an OST display the offset 
between the tracker sensor and the eye is an example of such a 
measurement. The casing of the tracker hides the exact 
location of its sensor origin. Similarly the skull and lobe hides 
the internal workings of the eye. Therefore the calibration 
method must estimate this type of measurements implicitly 
with the help of other measurable distances in the model. 

Standard Camera Calibration Procedure 

Common calibration procedures usually spring from 
camera decomposition (Ganapathy, 1984), camera 
resectioning (Hartley & Zisserman, 2000), camera pose 
estimation (Haralick, 1989), and direct linear transform (DLT) 
problems (Abdel- Aziz & Karara, 1971), in which the 
relationship between landmarks of known locations in the 
surrounding real world, p w , and points of known pixel 
coordinates on the screen, p s , are used to determine a 3-by-4 
camera matrix T (1). This corresponding point data can be 
expressed as a system of homogeneous linear equations (3) in 
which x is a vector of the elements in matrix T, and A is the 
result of matrix multiplication (2) when the perspective divide, 
w, has been substituted for, see Appendix A in (Sutherland, 
1974) for details. The values in p vv and p s are usually 
normalized to a common order of magnitude to condition the 
matrix A to reduce the effect of noise (Hartley, 1997) (Wan & 
Xu, 1996). The minimum number of correspondence points 
depends on how the degrees of freedom (DOF) of the 
calibration model have been parameterized, but at least six 
points are needed to solve for the 12 entries in T. Generally 
more points are gathered to further mitigate the effect of noise. 
This results in A being non-square, thus prompting the use of 
a More-Penrose pseudo inverse by which A is factorized into 
two bases (U,V) and a diagonal matrix (E) using singular 
value composition (SVD) (4). The eigenvalue calculations in 
SVD effectively perform a least square approximation. Thus 
the last column in the base matrix V, that by convention 
corresponds to the smallest singular value in £, can be 
interpreted as the calibration matrix T (5) that projects 
landmark coordinates onto screen coordinates with the 
smallest residual between screen points and corresponding 


landmarks as seen by the user. 

Ps = Tp w (1) 

Ti,i T 1>2 T 13 T 14 

w[u v 1] — T 21 T 2 2 T 2 3 T 24 x [x y z 1] t (2) 

T 3j i T 3 2 ^3,3 ^3,4 

Ax = 0 (3) 

A = UIV T (4) 
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V 3 , 12 V 6 j12 V 9 12 V 12 12 


At this point, some calibration procedures adjusts for non- 
linear lens effects by using matrix T as initial values for a 
Levenberg-Marquardt (LM) optimization procedure further 


refining T (Tsai, 1987) (Zhang 2000). Known measurements 
can be used as soft parameter constraints for the LM, and 
further robustness to noise can be provided by weighting the 
optimization cost function to decrease importance of outliers 
(Hartley & Zisserman, 2000). 

The matrix T can further be divided into extrinsic, R|t, 
and intrinsic, K, camera parameters with RQ-decomposition 
using Givens rotations. At this stage the offset from the tracker 
sensor origin to the center point of the eye is accessible 
through t (7), and R describes the rotation of the screen. K 
gives focal length, a, which in turn holds the distance to 
screen (in meters), f, if the pixel ratio (pixels per meter) is 
known (8). With knowledge of screen resolution, this 
information also gives the theoretical FOV. The practical FOV 
is however dependent on how well the eye aligns with the exit 


pupil defined by the principal point, (x 0 , y 0 ) [SOURCE], 



T = K[R|t] 

(6) 
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This procedure seems well documented and straight 
forward in the case of camera calibration, but how about its 
use with a human operator for the purpose of calibrating OST 
HMD? 

Human Operator Limitations 

To calibrate an OST HMD according to a standard 
camera calibration procedure the human operator must 
manually and subjectively align at least six landmarks with 
their corresponding pixel coordinates on the transparent screen 
through what is known as a boresight exercise. This presents 
challenges in terms of a) simultaneous correspondence point 
acquisition, b) human alignment precision, and c) the use of 
assisting technology. 

a) In the case of VST HMD, a video frame can be said to 

provide a “snapshot” in which hundreds of correspondence 
points can be collected simultaneously, but with OST a human 
operator will inevitably move between each alignment, 
thereby preventing the boresight lines from converging into a 
single eye point in space. A solution to this challenge is to use 
only one landmark instead of six, but reference it in the head 
coordinate system according to the Single Point Alignment 
Algorithm (SPAAM) (Tuceryan & Navab, 2000). Since the 
tracker sensor now serves as the origin, the user can collect an 
arbitrary number of correspondence points moving freely in 
between alignments as the boresight lines will all converge at 
the end of vector t (7). 

b) Human alignment precision in an OST HMD is 
predominantly dependent on head rotation precision which has 
been reported to be 0.13° (Nicholson, 1966), 0.9° (Verona, 
1978), and 0.04° (Wells & Griffins, 1987) standard deviation. 
Most recently Axholt et al. reported 0.25° precision for 12 
standing subjects using an OST HMD with VGA resolution 



through 28° by 37° FOV which effectively converts to an 
average of 4.3 pixel boresight misalignment. However, when 
the effect of head translation was removed, subjects exhibited 
sub-pixel alignment precision on the order of human visual 
acuity (0.016°, 0.2 pixel). Hence head rotations are thought to 
compensate postural sway (Axholt et al, 2009), but 
unfortunately the translation compensation is only possible in 
dynamic modeling using time series and is not an option for 
the static standard camera calibration procedure. 

c) In the presence of noise, calibration quality generally 
improves with the number of correspondence points (Hartley 
& Zisserman, 2000). However, due to human fatigue it is 
reasonable to believe that there exists an optimal calibration 
quality as a function of the number of correspondence points 
versus deteriorating user alignment over the time it takes to 
construct the alignments. To reduce the effect of noise in 
“time-consuming and error-prone human measurements” 
Gilson et al. (Gilson et al, 2008) replaces the user’s eye with a 
camera during calibration to estimate the parameters of the 
OST HMD with techniques similar to that of VST calibration. 
The subsequent evaluation is however made with the same 
camera, and not with a human eye, and therefore does not 
illustrate the effects of mismatching camera and eye position. 
Owen et al. (Owen et al, 2004) also use camera aided 
calibration and addresses the challenge of switching camera 
for human eye by dividing the calibration procedure in two 
phases, one for intrinsic and one for extrinsic parameters. 
However, together with Gene et al. (Gene et al, 2002), all 
three works rely on the fact that intrinsic parameters only need 
to be estimated once, and do not change between user 
sessions. This fact is unfortunately only true for an ordinary 
camera with a rigid camera house, but not for an OST HMD as 
the location of the eye, after it has replaced the camera, does 
not necessarily coincide with the apex of the frustum defined 
by the intrinsic parameters. Thus focal length, f, and principal 
point (x 0 , yo) also need to be adjusted between each user 
session. In conclusion, the technique of separating extrinsic 
and intrinsic parameters during calibration is not a viable path 
for reducing the number of alignments made by a human 
operator using a HDM OST. 

Problem Statement 

Under the hypothesis that OST HMD must be calibrated 
by a human operator, through a process that cannot be camera 
aided, we wish to investigate the limits of existing calibration 
procedures with respect parameter estimation robustness to 
human alignment noise in order to understand, and possibly 
compensate, for its limitations.. 

METHODS 

Modeling 

In a first step a standard DLT camera calibration 
according to SPAAM (Tuceryan & Navab, 2000) was 
modeled using an object oriented node tree structure 
implemented in MATLAB R2006b to manage the relationship 


between objects, transformation and projection matrices as 
depicted in Figure land described in (9). A frustum object was 
set to roughly model a Kaiser ProView 50ST with VGA 
resolution, 28° by 37° FOV and a projection plane located 
0.05 m in front of the user’s left eye. Each simulation iteration 



Tracker Coordinate System World Coordinate System 

Figure 1: Objects, transformations and projections used in simulation 

started by randomizing the user’s head rotation and position in 
front of the landmark, p wor id- Then a Nelder-Mead (NM) 
optimization algorithm estimated the optimal orientation of 
Tin to minimize the distance between a crosshair, p scree „, and 
the landmark as projected on the screen according to (9). T H -e 
simulates the offset between the tracker sensor and user’s eye, 
and thereby constitutes the extrinsic parameters of a standard 
camera calibration procedure. Similarly P E _ S is the intrinsic 
parameters. Since both are static when the user is wearing the 
HMD, they may be multiplied together into T CAL (10) which 
corresponds to the camera calibration matrix T (1) to be 
estimated. 

Pscreen = T w _ t * T X _ H * T H _ E * P E _ S * p wor]d (9) 
T w _t = I,T H _ E * P e _s = T cal (10) 

Pscreen ~~ Tt-H * T EAE * P wor ] d (11) 

In a second step, the simulation was extended with a 
normalization procedure as described in (Hartley, 1997), and a 
LM optimization procedure implementing quadratic, pseudo- 
Huber, and Blake-Zisserman cost functions further refining T 
(Hartley & Zisserman, 2000). It was at this point the effect of 
human noise in a standard camera calibration became 
noticeable as the LM search for an appropriate T would fail at 
surprisingly low noise levels regardless of cost function, 
which in turn prompted further changes to the model for 
investigation. 

Thus, in a third step, the NM optimization, which was 
implemented to model the human operator limitation a), using 
SPAAM, was excluded for the benefit of faster calculations 
enabling a Monte Carlo simulation. With no drawback to the 
final result, the frustum was thus made stationary and pointed 
towards not one but a collection of landmarks whose ideal 
projections were precomputed using (9). Also the LM step 


was excluded, thereby solely studying the effect of human 
noise on a standard DLT camera calibration. 


Simulation 

The independent variables of the simulation where: 1) 
number of correspondence points {6,9,12,16,20,42,81} 
distributed in a grid pattern with even spacing throughout the 
display surface, 2) human noise distribution {fixed range, 
white noise, Gaussian, Weibull} parameterized using range or 
99.9% probability, 3) human noise magnitude defined as pixel 
range {0,0. 2, 0.4, 0.6, 0.8, 1,2, 3,4, 5, 6} introduced as 

pennutations of p scree n in random (white) direction. The 
simulation was run with 1,000 iterations per combination of 
independent variables to collect the dependent variables T C al 
and normalized calibration error (NCE) (Weng et al, 1992). 

RESULTS 


Figure 3 shows how the translation component of T H -e, 
i.e. the estimation of the relative location between head tracker 
sensor and user’s eye, t, varies based on user alignment noise 
and number of correspondence points. Notably, the estimation 
error is particularly large along the user’s line of sight (z axis). 


Variability In Eye Point Translation as a Function of Alignment Noise and Number of Correspondence Points 
X Translation Y Translation 2 Translation 



Figure 3 Data from Monte Carlo simulation illustrating how variability in 
eye point translation vector t increases with alignment noise. The data in 
the plot was collected in a right-handed coordinate system with negative z 
along the user’s viewing direction, using a fixed range model for 1,000 
iterations. Boxes denote quartiles. Outliers, marked with red + signs, are 
defined as 1.5 times interquartile range. 



X translation 

Y translation 

Z translation 


20 p 

81 p 

20 p 

81 p 

20 p 

81 p 

Fixed * 

0.099 

0.052 

0.103 

0.051 

0.548 

0.259 

White 

0.059 

0.030 

0.051 

0.030 

0.324 

0.137 

Gaussian 

0.034 

0.017 

0.030 

0.017 

0.1731 

0.089 

Weibull 

0.017 

0.009 

0.016 

0.009 

0.0929 

0.041 


Table 1: Interquartile range (in meters) for eye point translation as a 
function of 20 or 81 correspondence points and four various user 
alignment noise models of 6 pixel range magnitude. 


Similar results are found in the rotational component of 


Variability in Screen Orientation as a Function of Alignment Noise and Number of Correspondence Points 
9 points 20 points 81 points 



Figure 2 Variability in screen orientation R as a function of alignment 
noise and number of correspondence points based on a fixed range noise 
model for 1,000 iterations. Orientation is reported as the dot product 
cosine angle between the original screen orientation and estimated screen 
orientation after simulated calibration. Note the varying scale on the y 
axis. 

T H -e, i.e. the orientation of the screen relative to user’s eye, R. 
Figure 2 shows that there is a 50% probability of estimating 
the screen rotation accurately within ±1° if the user can 
correctly align 8 1 correspondence points within a range of six 
pixels. Similar to the trend in Table 1, the 50% probability 
accuracy is somewhat better for the other noise models. 

The variability of the intrinsic parameters is shown in 
Table 2. Through division with the pixel ratio (8) the values 
the measurements in which there is a 50% probability to find 
the estimated frustum apex and subsequently the eye point. 



Principal 
Focal Len 

Point X 
gthX 

Principal Point Y 
Focal Length Y 


20 p 

81 p 

20 p 

81 p 

Fixed 

0.013 

0.007 

0.014 

0.007 

White 

0.008 

0.004 

0.008 

0.003 

Gaussian 

0.004 

0.002 

0.004 

0.002 

Weibull 

0.002 

0.001 

0.002 

0.001 


Table 2: Interquartile ranges (in meters) for the intrinsic parameters 
as a function of 20 or 81 correspondence points and four various 
user alignment noise models of 6 pixel range magnitude. 


The boxes in Figure 3 represents the parameters’ range 
for which there is a 50% probability of repetability. Table 1 
presents the numerical values for T h _e relative the four noise 
models. E.g. a user exhibiting a head orientation noise 
equivalent to six pixel fixed range would have to align 81 
points for a 50% probability of the eye point being .correctly 
estimated within 13 cm (=0.259/2) from its true location (*). 


DISCUSSION 

This paper was justified because investigations on camera 
calibration accuracy usually models lower levels of noise than 
those found in a human operator (Sun & Cooperstock, 2006). 
To bridge the data gap between camera and human 
performance a controlled environment, such as a Monte Carlo 
simulation, was motivated. 


The results showed that, even with a fair amount of 
alignments, a standard camera calibration procedure is 
unlikely to estimate user’s eye point with repeatable accuracy 
in the presence of head rotation noise levels found in 
literature. However, as the predominant error is along the 
user’s viewing direction, there may be some instances in 
which the performance of the standard camera calibration may 
be acceptable: If the eye point is erroneously estimated along 
the user’s line of sight, the registration error will manifest 
itself as a scaling error as long as the user looks straight ahead. 
However, as the head is turned, the frustum will pivot around 
the erroneous eye point and introduce a lateral registration 
error as the frustum is shifted sideways. The current findings 
should be taken into account for applications with a fixed user, 
i.e. in vehicles and observations posts where the user explores 
primarily with head rotations. 

To determine if alignment distribution had an effect on 
parameter robustness, one fixed range and three noise models 
were implemented and set to have range as a common 
parameter. While white and Gaussian noise increase their 
spread with increasing range, the Weibull distribution, popular 
for modeling cluster concentrations, shifts its mode towards 
zero for increasing range. This should be kept in mind when 
comparing the results. 
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